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Abstract 

We develop a new collaborative filtering (CF) method that combines both previously known 
users' preferences, i.e. standard CF, as well as product/user attributes, i.e. classical function 
approximation, to predict a given user's interest in a particular product. Our method is a 
generalized low rank matrix completion problem, where we learn a function whose inputs are 
pairs of vectors - the standard low rank matrix completion problem being a special case where 
the inputs to the function are the row and column indices of the matrix. We solve this generalized 
matrix completion problem using tensor product kernels for which we also formally generalize 
standard kernel properties. Benchmark experiments on movie ratings show the advantages of 
our generalized matrix completion method over the standard matrix completion one with no 
information about movies or people, as well as over standard multi-task or single task learning 
methods. 



1 Introduction 

Collaborative Filtering (CF) refers to the task of predicting preferences of a given user based on 
their previously known preferences as well as the preferences of other users. In a book recommender 
system, for example, one would like to suggest new books to a customer based on what he and others 
have recently read or purchased. This can be formulated as the problem of filling a matrix with 
customers as rows, objects (e.g., books) as columns, and missing entries corresponding to preferences 
that one would like to infer. In the simplest case, a preference could be a binary variable (thumbs 
up/down), or perhaps even a more quantitative assessment (scale of 1 to 5). 
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Standard CF assumes that nothing is known about the users or the objects apart from the prefer- 
ences expressed so far. In such a setting the most common assumption is that preferences can be 
decomposed into a small number of factors, both for users and objects, resulting in the search for a 
low-rank matrix which approximates the partially observed matrix of preferences. This problem is 
usually a difficult non-convex problem for which only heuristic algorithms exist |14j . Alternatively 
convex formulations have been obtained by relaxing the rank constraint by constraining the trace 
norm of the matrix [15] . 

In many practical applications of CF, however, a description of the users and/or the objects through 
attributes (e.g., gender, age) or measures of similarity is available. In that case it is tempting to 
take advantage of both known preferences and descriptions to model the preferences of users. An 
important benefit of such a framework over pure CF is that it potentially allows the prediction of 
preferences for new users and/or new objects. Seen as learning a preference function from examples, 
this problem can be solved by virtually any algorithm for supervised classification or regression 
taking as input a pair (user, object). If we suppose for example that a positive definite kernel 
between pairs can be deduced from the description of the users and object, then learning algorithms 
like support vector machines or kernel ridge regression can be applied. These algorithms minimize 
an empirical risk over a ball of the reproducing kernel Hilbert space (RKHS) defined by the pairwise 
kernel. 

Both the rank constraint and the RKHS norm restriction act as regularization based on prior hy- 
pothesis about the nature of the preferences to be inferred. The rank constraint is based on the 
hypothesis that preferences can be modelled by a limited number or factors to describe users and 
objects. The RKHS norm constraint assumes that preferences vary smoothly between similar users 
and similar objects, where the similarity is assessed in terms of the kernel for pairs. 

The main contribution of this work is to propose a framework which combines both regularizations 
on the one hand, and which interpolates between the pure CF approach and the pure attribute- 
based approaches on the other hand. In particular, the framework encompasses low-rank matrix 
factorization for collaborative filtering, multi-task learning, and classical regression/classification 
over product spaces. We show on a benchmark experiment of movie recommendations that the 
resulting algorithm can lead to significant improvements over other state-of-the-art methods. 



2 Kernels and tensor product spaces 

In this section, we review the classical theory of tensor product reproducing kernel Hilbert spaces, 
providing a natural generalization of finite-dimensional matrices for functions of two variables. The 
general setup is as follows. We consider the general problem of estimating a function / : X x y — > R 
given a finite set of observations in X x y x R. We assume that both spaces X and y are endowed 
with positive semi-definite kernels, respectively k : X x X — > R and g : y x y — > R, and denote by 
JC and Q the corresponding RKHS. A typical application of this setting is where x £ X is a person, 
y £ y is a movie, kernels k and g represent similarities between persons and movies, respectively, 
and / (x, y) represents a person's x rating of a movie y. We note that if X and y are finite sets, 
then / is simply a matrix of size | X | x \y\. 



2.1 Tensor product kernels and RKHS 

We denote by fc® : (X x y) x (X x y) — > R the tensor product kernel, known to be a positive definite 
kernel [1 p. 70]: 

k® ((xi,yi) , (x 2 ,y 2 )) = fc (xi,x 2 ) g (yi,y 2 ) , (1) 
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and by TL® the associated RKHS. A classical result of Aronszajn [5] states that Tt® is the tensor 
product of the two spaces K and Q (denoted K. ® G), i.e., is the completion of all functions 
/ : X x y —y R, which can be finitely decomposed as f(x, y) = Y^k=i u k{x)vk(y), where Uk £ 1C and 
Vk € Q, k = 1, . . . ,p. An atomic term defined as f(x, y) = u(x)v(y), with u £ /C and u € <7, is usually 
denoted / = u®v. The space is equipped with a norm such that v\\® = ||u||a; x \\v\\g, and 
thus || J2k u k ^VkW 2 = J2k,l( u k>' u 'l)ic( v k,vi)g. 



2.2 Rank 

An element of 7i® can always be decomposed as a possibly infinite sum of atomic terms of the form 
u(x)v(y) where u £ JC and v £ Q. We define the rank rank(/) GNU {oo} of an element / of Hg as 
the minimal number of atomic terms in any decomposition of /, i.e, rank(/) is the smallest integer 
p such that / can be expanded as: 

p 

f ( x > y) = ^2 Ui ( x ) Vi ( y ) ' 

1=1 

for some functions Ui, . . . , u p € K. and v\, . . . , v p € G, if such an integer p does exist (otherwise, the 
rank is infinite). 

When the two RKHS are spaces of linear functions on an Euclidean space, then the tensor product 
can be identified to the space of bilinear forms on the product of the two Euclidean spaces, and the 
notion of rank coincides with the usual notion of rank for matrices. We note that an alternative 
characterization of rank(/) is the supremum of the ranks of the matrices M defined by M,y = 
/ (xj , Yj ) over the choices of finite sets Xi , . . . , x m £ X and yi , . . . , y p € y (see technical annex for 
a proof). 



2.3 Trace norm 

Given a rectangular matrix M, the rank is not an easy function to optimize or constrain, since it is 
neither convex nor continuous. Following the 1-norm approximation to the 0-norm, the trace norm 
has emerged has an efficient convex approximation of the rank [5] [TS]. The trace norm ||M||* is 
defined as the sum of the singular values. This definition is not easy to extend to functional tensor 
product spaces because it involves eigen-decompositions. Rather, we use the equivalent formulation 

\\M\U= inf h\\U\\ 2 F + \\V\\ 2 F ) 

M=UV 2 

where ||?7||| = tr UU T is the squared Frobenius norm. 
We thus extend the notion of trace norm as 

11/11* =. inf \ f^iWWl + IM|) 



Lemma 1 This is a norm, equal to the sum of singular values when the two RKHS are spaces of 
linear functions on an Euclidean space. 

The main attractiveness of the trace norm is its convexity [8, 15J. However, the trace norm does not 
readily yield a representer theorem, and as shown in Section [3. 31 it is more practical to penalize the 
trace norm of the matrix of estimates. 
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3 Representer theorems 



In this section we explicitly state and prove representer theorems in tensor product spaces when a 
functional is minimized with rank constraints. These theorems underlie the algorithms proposed in 
the next section. 

In a collaborative filtering task, the data usually have a matrix form, i.e., many x's (resp. y's) are 
identical. We let xi , . . . , x nx denote the nx distinct values of elements of X in the training data, 
and, respectively, yi, . . . ,y ny denote the ny distinct values of elements of y. We assume that we 
have observations of only a subset of {1, . . . , n^} x {1, . . . , ny}. We thus denote i(u) and j(u) the 
indices of the it-th observation and z u the observed target. 



3.1 Classical representer theorem in the tensor product RKHS 

Given a loss function I : K x R — > M, for example the square loss function £(z, z') = (z — z') 2 , a first 
classical approach to learn dependencies between the pair (x, y) and the variable z is to consider 
it as a supervised learning problem over the product space X X y, and for example to search for a 
function in the RKHS of the product kernel which solves the following problem: 

f ™ n |^E^(/N«).y^))'^)+ A ll/llll ■ ( 2 ) 

By the representer theorem [9] the solution of {2J has an expansion of the form: 

n n 

/( x >y) = auk ® (( x i(u),yi(u)) , (x,y)) = ^a u k (x i(u) ,x) g (y,-( u ),y) , 
u=i u=i 

for some vector (ai, . . . , a n ) € R™. Note that the number of parameters a is the number of observed 
values. For many loss functions the problem ([2]) boils down to classical machine learning algorithms 
such as support vector machines, kernel logistic regression or kernel ridge regression, which can be 
solved by usual implementations with the product kernel ([1]). 

3.2 Representer theorem with rank constraint 

In order to take advantage of the possible representation of our predictor as a sum of a small number 
of factors, we propose to consider the following generalization of @: 



mm 

rank(/Kp 



f 1 " 



)),Zu)+\\\f\\l} ■ (3) 



As the following proposition shows, the solution of this constrained minimization problem can also 
be reduced to a finite-dimensional optimization problem (see a proof in the technical annex): 

Proposition 1 The optimal solution of can be written as f = Y^—i Ui <g) Vi, where 

TLX ny 

Ui ( x ) = ^2 a n k ( x ^ x ) and Vi (y) = y^Aig(y^y) , i = i,---,p, (4) 

1=1 1=1 

where a e W 1xX p and (3 G W l y x P 
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This proposition is a crucial contribution of this paper as it allows to learn the function /, by learning 
the coefficients a and 0: denoting by ak the fc-th column of a (similarly for /?), we obtain from 
Proposition [1] that an equivalent formulation of is: 



mm 



u=l \k=l ) i=l j=l 



where K is the nx x rix kernel matrix for the elements of X (similarly for G). In order to link 
with the trace norm formulation in the next section, if we denote 7 = 53fc=i a kf3j , we note that the 
Tlx x ny matrix of predicted values is equal to F — K^fG resulting in the following optimization 
problem: 



(- n 



(u),j(u),z u ) > + Xtrj K~fG . (6) 



3.3 Representer theorems and trace norm 

The trace norm does not readily lead to a representer theorem and a finite dimensional optimization 
problem. It is thus preferable to penalize the trace norm of the predicted values Fij — /(xj,yj), 
and minimize 

|-X^^(/( x ^)'yj(«))^-)+^ll(/(x t ,y J ))||* + A||/|||l . (7) 
/ew« [n u=i j 

We have the following representer theorem (whose proof is postponed to the technical annex) for 
the problem ([7]): 

Proposition 2 The optimal solution of can be written in the form f{x, y) = 5Z»=i Sj=i %i)k(y, yj), 

where j eR nxXny . 

The optimization problem can thus be rewritten as: 



mm 



1 

- V7 ((K 7 G) i(u)}j(u) ,z u ) + fi\\K 7 G\U + Xtr^KjG (8) 



Note that in contrast to the finite representation without any constraint on the rank or the trace 
norm (where the number of parameters is the number of observed values), the number of parameters 
is the total number of elements in the matrix, and this method, though convex, is thus of higher 
computational complexity. 



3.4 Reformulation in terms of Kronecker products 

Kronecker products Given a matrix B € ]R mxn anc j a matrix C £ M. pxq , the Kronecker product 
A = B ® C is a matrix in jj m P XIl< ? defined by blocks of size p x q where the block (i,j) is hijC. 

The most important properties are the following (where it is assumed that all matrix operations are 
well-defined): (A® B){C ® D) = AC <x> BD, (AfgiB)- 1 =A- 1 (g>B- 1 , (A(g> B) T = A T <g> B T , and 
if Y = CXB T & vec(F) = (B ® C)vec(X) where vec(X) is the stack of columns of X. 
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Reformulation We have M = K^G, and thus vec(M) is the vector of predicted values for all 
pairs (xi,yj) and is equal to vec(M) = (K G)vec(7). The matrix K ® G is the kernel matrix 
associated with all pairs, for the kernel k®. The results in this section does not provide additional 
representational power beyond the kernel k®, but present additional regularization frameworks for 
tensor product spaces. In the next section, we tackle the representational part of our contribution 
and show how the kernels k and g can be tailored to the task of matrix completion. 

4 Kernels for matrix completion with attributes 

The three formulations ([2]) , ([3]) and ([7]) differ in the way they handle regularization. They all require 
a choice of kernel over X and y to define the RKHS norm in , which we discuss in this section. 
The main theme of this section is the distinction between kernels linked to the attributes and the 
kernel linked to the identities of each different x and y (referred to as Dirac kernels). 

Dirac kernels In the standard collaborative filtering framework, where no attribute over x or 
y is available, a natural kernel over X and y is the Dirac kernel (fcui rac (x, x') = 1 if x = x', 
otherwise). If both k and g are Dirac kernels, then k® is also a Dirac kernel by |T]) and the classical 
approach ([2]) is irrelevant in that case (the function / being equal to on unseen examples). The 
low-rank constraint added in (|3|) results in a relevant problem: in fact for A = + we exactly recover 
the classical low-rank matrix factorization problem. 

Attribute kernels When attributes are available, they can be used to define kernels which we 
denote k Attributes below. When both k and g are kernels derived from attributes, then (|2|) boils down 
to classical regression or classification over pairs. Problem ([3]) provides an alternative problem, where 
the rank of the function is constrained. 

Multi-task learning Suppose now that attributes are available only for objects in X, and not 
for y. It is then possible to take k = kAttribute and g — kuirac- In that case the optimization 
problem §2§ boils down to solving a classical classification or regression problem for each value 
Hi independently. Adding the rank constraint in ([3|) removes the independence among tasks by 
enforcing a decomposition of the tasks into a limited number of factors, which leads to an algorithm 
for multitask learning, based on a low-rank representation of the predictor function. This approach 
is to be contrasted with the framework of [7] , which is equivalent to X finite of cardinality p, and 
k (x, x') = 1 — A + Xpk Dirac- Our framework focuses on a low rank representation for the predictor 
of each task, while the framework of [7] focuses on a set of predictor for each task that has small 
variance. An extension of this multi-task learning framework, leading to similar penalizations by 
trace norms, was independently derived by Argyriou et al. pQ. 

General formulation Supposing now that attributes are available on both X and y, let us 
consider the following interpolated kernels: 




where < f] < 1 and < £ < 1. The resulting product kernel is a sum of four terms: 



fc® — VCkAttribute^Attribute + 



'Attribute Dirac 



+ 



+ (1 - ryKfc; 



,x i,V 
Dirac' 1 Attribute 



+ (l-ry)(l-C)fc; 



.x i.y 

Dirac Dirac' 
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By varying rj and £ this kernel provides an interpolation between collaborative filtering (rj = £ = 0) , 
classical attribute-based regression on pairs (77 = £ = 1), and multi-task learning (rj = and £ = 1, 
or rj = 1 and £ = 0). 

In terms of kernel matrices on the set of all pairs, if we denote Kau the kernel matrix associated 
with the attributes associated with X, and Gau the kernel matrix associated with the attributes 
associated with y, this is equivalent to using the matrix (i]Kau + (1 — ® ((Gau + (1 — 0-0 = 

TyCfGttt ® G A tt + - C)^« ® 7 + 0- ~ »7)C^ ® G^tt + (1 - ??)(1 - C) 7 ® which is the sum of four 
positive kernel matrices. The first one is simply the kernel matrix for the tensor product space, while 
the last one is proportional to identity and usually appears in kernel methods as the numerical effect 
of regularization [13] . The two additional matrices makes the learning across rows and columns 
possible. 



Generalization to new points One the usual drawbacks of collaborative filtering is the impossi- 
bility to generalize to unseen data points (i.e., a new movie or a new person in the context of movie 
recommendation). When attributes are used, a prediction based on those can be made, and thus 
using attributes has an added benefit beyond better performance on matrix completion tasks. 



5 Algorithms 

In this section, we describe the algorithms used for the optimization formulation in ^ and |5]). 
We also show that recent developments in multiple kernel learning can be applied to both setting 
(enforced rank constraint or trace norm). 



5.1 Fixed rank 

The function (a, j3) £ YZ=i ^ (£fc=i { Ka k)i(u) (Lf3k) > z u) +A £f=i Ej=i a l Ka jPl Gfij is con- 
vex in each argument separately but is not jointly convex. There are thus two natural optimization 
algorithms: (1) alternate convex minimization with respect to a and (3, and (2) direct joint mini- 
mization using Quasi-Newton iterative methods [5] (in simulations we have used the latter scheme). 

As in [12 , the fixed rank formulation, although not convex, has the advantage of being parameterized 
by low-rank matrices and is thus of much lower complexity than the convex formulation that we know 
present. We present experimental results in the next section using only this fixed rank formulation. 



5.2 Convex formulation 

If the loss I is convex, then the function 7 1— > i J2Z=i ^ ((-^7^)i(«),i(«)) z «) + ^\\K^yG\\^ + Xtr j T K'yG 
is a convex function. However, even when the loss is differentiable, the trace norm is non differen- 
tiable, which makes iterative descent methods such as Newton- Raphson non applicable [6J. 

For specific losses which are SDP-representable, i.e., which can represented by a semi-definite pro- 
gram, such as the square loss or the hinge loss, the minimization of this function can be cast as 
semi-definite program (SDP) [B]. For differentiable losses which are not SDP-representable, such as 
the logistic loss, an efficient algorithm is to modify the trace norm to make it differentiable (see e.g 
[12)). For example, instead of penalizing the sum of singular values A^, one may penalize the sum of 
V Aj + e 2 , which leads to a twice differentiable function [TUj . 
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5.3 Learning the kernels 



In this section, we show that the rank constraint and the trace norm constraint can also be used in 
the multiple kernel learning framework J3[ [IT] . We can indeed show that if the loss £ is convex, then, 
as a function of the kernel matrices, the optimal values of the optimization problems ([5]) and ([8]) are 
convex functions, and thus the kernel can be learned efficiently by minimizing those functions. We 
do not use this method in our experiments, however, we only include this for completeness. 

Proposition 3 Given [3\, . . . ,/3 P) the following function is convex in K: 

71 u=l \fc=l / i=l j=l 

The following function only depends on the Kronecker product K ® G and is a convex function of 
K®G. 

(K, G) h-> min J - V £ ((K 7 G) i(u) j(u) , z u ) + vl\\KjG\L + A tr ~f T K^G \ . 

This proposition (whose proof is in the technical annex) shows that in the case of the rank constraint 
([5]), if we parameterize K as K = £^ . VjKji the weights r\ and a can be learned simultaneously [31 [TT]. 
In particular, the optimal weighting between attributes and the Dirac Kernel can be learned directly 
from data. Note that a similar proposition holds when the role of x and y are exchanged; when 
alternate minimization is used to minimize the objective function, the kernels can be learned at each 
step. In the case of the trace norm constraint (JSJ) it shows that we can learn a linear combination 
of basis kernels, either the 4 kernels presented earlier obtained from Dirac's and attributes, or more 
general combinations. We leave this avenue open for future research. 



6 Experiments 

We tested the method on the well-known MovieLens 100k dataset from the GroupLens Research 
Group at the University of Minnesota. This dataset consists of ratings of 1682 movies by 943 users. 
The ratings consisted of a score from the range 1 to 5, where 5 is the highest ranking. Each user 
rated some subset of the movies, with a minimum of 20 ratings per user, and the total number of 
ratings available is exactly 100,000, averaging about 105 per user. To speed up the computation, 
we used a random subsample of 800 movies and 400 users, for a total of 20541 ratings. We divided 
this set into 18606 training ratings and 1935 for testing. This dataset was rather appropriate as it 
included attribute information for both the movies and the users. Each movie was labelled with at 
least one among 19 genres (e.g., action or adventure), while users' attributes age, gender, and an 
occupation among a list of 21 occupations (e.g., administrator or artist). 

We performed experiments using the rank constraint described in [31 and we used the more standard 
approach of cross validation to choose kernel parameters. Thus, our method requires selection of 
four parameters: the rank d of the estimation matrix; the regularization parameter A; and the values 
E [0,1], the tradeoff between the Dirac kernel and Attribute kernel, for the users and movies 
respectively. The parameters A and d both act as regularization parameters and we choose them 
using cross validation. The values 77, £ were chosen out of {0,0.15,0.5,0.85, 1}, the rank d ranged 
over {50, 80, 130, 200}, and A was chosen from {25, 5, 1, 0.2, 0.04} x 10~ 6 . 

In Table 1, we show the performance for various choices or rank d and for various values of r\ and 
(, after selecting A in each case using cross-validation. We also show in bold the performance for 
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the parameters selected using cross-validation. Notice that, performance is consistently worse when 
77 and £ are chosen at the corners when compared with values at the interior of [0, 1] x [0, 1]. We 
observed this to be true, in fact, not only when d and A are chosen by cross-validation, but for every 
choice of d and A. Figure 1 shows the test mean squared error for d and A selected at each point 
using cross validation (Left), as well as (Right) that for a fixed d and A (the plot looks similar for 
any fixed values of d and A) over the range of 77 and £■ Observe that the performance is best in the 
interior of the 77, £ area, and worsens as we get towards the edges and particularly at the corners. 
This is what we might expect: at the corners we are no longer taking advantage of either attribute 
information or ID information, either for the class of movies or of the class of users. 

Also notice in Table 1 that, as expected, regularization through controlling the rank d is indeed 
important. Regularization through parameter A is also necessary: for (d, 77, C) = (130,0.15,0.15) 
shown in Table 1, test performance is 1.0351 when A = 0.2, but is 1.1401 when A = 0.04, and 1.1457 
when A = 1 (we observe such changes in performance across values of A for all other choices of d, 17, 
and £). Hence, it is important to balance both regularization terms. In fact, we use cross validation 
to select all parameters. 





fa, = (0,0) 


(0,1) 


(1.0) 


(1,1) 


(0.5,0.5) 


(0.15,0.15) 


d = 


50 


1.5391 


1.6436 


1.1999 


1.1310 


1.1106 


1.0676 


d = 


80 


1.5552 


1.4008 


1.2221 


1.1138 


1.0544 


1.0478 


d = 


130 


1.3294 


1.3787 


1.2315 


1.0999 


1.0611 


1.0351 


d = 


200 


1.3806 


1.4234 


1.2192 


1.0818 


1.0587 


1.0596 



Table 1: Mean Squared Test Error results for various values of 77 and £ for three choices of rank d. 
In each of these, A was chosen using cross-validation. Bold indicates the performance for the final 
parameters selected. 




Figure 1: Left: A plot of Mean Squared Error as we vary 77 and £. We chose d and A using cross- 
validation. Right: A plot of Mean Squared Error as we vary 77 and £ (for a fixed choice of A and d). 
In both cases we see the performance worsen at the extreme values when either 77 or £ become 0. 
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7 Conclusion 



We presented a method for solving a generalized matrix completion problem where we have attributes 
describing the matrix dimensions. Various approaches, such as standard low rank matrix completion, 
are special cases of our method, and preliminary experiments confirm the benefits of our method. 
An interesting direction of future research is to explore further the multi-task learning algorithm we 
obtained with low-rank constraint. On the theoretical side, a better understanding of the effects of 
norm and rank regularizations and their interaction would be helpful. 
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A Rank of a function in a tensor product RKHS (Section 
2.2) 

An element of Tt® can always be decomposed as a possibly infinite sum of atomic terms of the form u(x)v(y) 
where u G K, and v G Q. We define the rank rank(/) G N U {oo} of an element / of as the minimal 
number of atomic terms in any decomposition of /, i.e, rank(/) is the smallest integer p such that / can be 
expanded as: 

v 

/(*,y) = J2 Ul ( x ) Wi ( y ) ' 
1=1 



10 



for some functions tti, . . . , u p £ JC and Wi, . . . , v p £ Q, if such an integer p does exist (otherwise, the rank is 
infinite) . 

When the two RKHS are spaces of linear functions on an Euclidean space, then the tensor product can be 
identified to the space of bilinear forms on the product of the two Euclidean spaces, and the notion of rank 
coincides with the usual notion of rank for matrices. We note that an alternative characterization of rank(/) 
is the supremum of the ranks of the matrices Mi t j = / (x^ yj) over the choices of finite sets xi , . . . , x m £ X 
and yi, . . . , y p £ y (see technical annex for a proof). 

Proposition 4 rank(f) is equal to the supremum of the ranks of the matrices Mij = /(xi,yj) over the 
choices of finite sets xi, . . . , x m G X and yi, . . . , y p £ y. 

Proof If / can be expanded as X^f=i u i ® v i> then for any choice of finite sets xi,...,x m £ X and 
yi, • • • , y P 6 y, the matrix Mij = / (xi, y,-) can be expanded as M — $3f=i ^i w i i where tij = Ui(xj) and 
= v i(yj), an d has therefore a rank not larger than p. Taking the smallest possible p, this shows that 
the rank of M can not be larger than the rank of /. Conversely, if / = X^=i Ui ® Vi with rank(/) = p, 
we need to show that there exist two sets of points such that the corresponding matrix M has rank at 
least p. We first observe that both sets (ui) i=1 and (vi) i=1 form linearly independent families in K, 

and Q, respectively. Indeed, if for example u p = Yl'tZi fiUi, then f(x, y) = X^?=i u i{ x ) \Pi + i"iV p ] (y) which 
contradicts the hypothesis that rank(/) = p. It is therefore possible to find to sets of points xi,...,x p £ X 
and yi, . . . ,y p £ y, such that the both sets of vectors (tj) i=1 p and (Wi) i=1 form linearly independent 
families in W (where tij — Ui(xj) and Wij = v i{Uj))- The matrix M corresponding to these sets of points 
has rank p. ■ 



B Representer theorem with rank constraint (Proof of Propo- 
sition 1) 



Here we show that the solution of: 



mm 

/eW®, rank(/Kp 



j^X^(/( x *(«)>yi(«)) > z «) + A ll/lll| 



(9) 



can be reduced to a finite-dimensional optimization problem: 

Proposition 5 The optimal solution of @) can be written as f = X/^=i u i®Vi, where 

n x ny 

«i ( x ) = ^atuk (x;,x) and Vj(y) = ^ fag (y t ,y) , i = l,...,p, (10) 

i=i i=i 

where a G IT* xp and f3 G R ny xp 

Proof Let 7if, denote the subspace of Tl® spanned by the functions fcgj ((xi,yj) , .) for i — 1, . . . , nx and 
j = 1, . . . , ny, and denote its orthogonal supplement (Htg, = (BTt®). Similarly, let fC s and Q s denote 
respectively the subspace of K. and Q spanned by the functions k (xi, .) for i — 1, . . . ,nx, (resp. g (yi, .) 
for i — 1, . . . ,ny) and /C x and G ± the corresponding orthogonal supplements in /C and Q. Any function 
/ G 7i® of rank less than p can be expanded as / = 5D? =1 Ui (g) Vi, with Ui £ K, and i>i G 5- Now, denote by 
Ui — uf + uf~ the unique decomposition of Ui over = K, s © , and define similarly Vi = vf + vf with 
G {? s and G G ± - We therefore obtain: 



s 



i — l i — l i — l i—l 

We now claim that the last three terms in (JTTJ are in 7i®. Indeed, taking for example a term uf <g> vf, we 
have uf ® vf~ (xj,y;) = uf (xj) vf~ (yi) — because ^ (yi) = 0, and therefore uf ® vf £ Tl®. A similar 
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computation shows that ® vf and uf~ ® are both in l~t®. On the other hand, because uf £ £ s and 
«i 6 5 , one easily gets that wf g) vf € 7x§, . Therefore z2-=i u i v i i s the orthogonal projection of s onto 
and is of rank at most p. We can conclude like for the classical representer theorem that is / is not 
restricted to X/f=i u f v ii then X^=i w f v f provides a rank p function of Ti® with a strictly smaller functional 
value, leading to a contradiction. ■ 



C Representer theorems and trace norm (proof of Proposi- 
tion 2) 

Here we show a representer theorem for the solution of: 

f ?l n \zitHffc(.^yi(u)),zu)+»\\(f(xi,yM* + ^f\\l\ • (12) 

Proposition 6 The optimal solution of can be written in the form f(x, y) = Y^,7=i Sj=i lijk{ x i x i)k(y, Vj) 
where 7 G R n * xn y . 

Proof Our objective function is the sum of a function of values of / for all pairs (xi,y.j) plus a squared 
RKHS norm. The usual representer in the RKHS associated with fcg> then applies, and we get a solution of 
the form f(x,y) = £^ 7 «M(*,v), (*<»»)) = E^iE^i 7«*(^^)*(l/.Vj) where 7 € IT 1 **™*. ■ 



D Learning the kernels (proof of Proposition 3) 

In this section we prove that if the loss I is convex, then, as a function of the kernel matrices, the optimal 
values of the optimization problems proposed in the paper are convex functions, 

Proposition 7 Given f3\, . . . , j3 p , the function 

( i 2 / p \ p p "j 

" I U=l \fc=l / 1 = 1 j-1 J 

is convex in K . 

Proof Given (3, the objective function is convex and thus (under appropriate classical conditions), the 
minimum value is equal to the maximum value of the dual problem, obtained by adding variables q u and 
adding constraints q u = ~}2 1 k = i(^ a k)i(u)(L/3k)j(u), together with the appropriate Lagrange multipliers. The 
results follow from derivations obtained in [31 ITT]. ■ 



Proposition 8 The function 

( 1 

, K u),Zu) + fx\\KjG\U + \trj T K~,G \ (13) 
only depends on the Kronecker product K ®G and is a convex function of K ®G. 



H:(K,G)^ min j - V / ((tf 7 G)i W , 3 ( 



Proof The objective function (|13f) was originally obtained from (I12|) . which is the sum of a term that is a 
convex function of the values of a function / £ Tig,, for all pairs (x;,yj), and the norm ||/ 1|| . The results 
of [TTJ applies to this case and thus the minimum value is a convex function in the kernel matrix for all pairs 
(xi,yj) and the kernel fc®, which is exactly K ®G. ■ 
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